65 research outputs found

    A hardware mechanism to reduce the energy consumption of the register file of in-order architectures

    Get PDF
    This paper introduces an efficient hardware approach to reduce the register file energy consumption by turning unused registers into a low power state. Bypassing the register fields of the fetch instruction to the decode stage allows the identification of registers required by the current instruction (instruction predecode) and allows the control logic to turn them back on. They are put into the low-power state after the instruction use. This technique achieves an 85% energy reduction with no performance penalty

    DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

    Full text link
    Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time

    Enhancing the Privacy of Machine Learning via faster arithmetic over Torus FHE

    Get PDF
    The increased popularity of Machine Learning as a Service (MLaaS) makes the privacy of user data and network weights a critical concern. Using Torus FHE (TFHE) offers a solution for privacy-preserving computation in a cloud environment by allowing computation directly over encrypted data. However, software TFHE implementations of cyphertext-cyphertext multiplication needed when both input data and weights are encrypted are either lacking or are too slow. This paper proposes a new way to improve the performance of such multiplication by applying carry save addition. Its theoretical speedup is proportional to the bit width of the plaintext integer operands. This also speeds up multi-operand summation. A speedup of 15x is obtained for 16-bit multiplication on a 64-core processor, when compared to previous results. Multiplication also becomes more than twice as fast on a GPU if our approach is utilized. This leads to much faster dot product and convolution computations, which combine multiplications and a multi-operand sum. A 45x speedup is achieved for a 16-bit, 32-element dot product and a ~30x speedup for a convolution with a 32x32 filter size

    A distributed processor state management architecture for large-window processors

    Get PDF
    Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of processor resources. Check-pointing, however, leads to an imprecise processor state recovery on mis-predicted branches and exceptions and re-execution of correct-path instructions after state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This paper proposes a new processor architecture called a Multi-State Processor (MSP). The MSP does not use check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. The MSP uses a novel register management architecture allowing implementation of large register files with simpler and more scalable register allocation, renaming, and release. It is also key to precise processor state recovery mechanism. The MSP is shown to improve IPC by 14%, on average, for integer SPEC CPU2000 benchmarks compared to a check-pointing based mechanism ([2]) when a fast and simple branch predictor is used. With a very aggressive branch predictor the IPC improvement is 1%, on average, and 3% if some of the programs are optimized for the MSP. The MSP also reduces the average number of executed instructions by 16.5% (12% for the aggressive branch predictor), mostly due to precise state recovery. This improves the MSP processor energy efficiency even though it uses a larger register file.Peer ReviewedPostprint (published version

    Instruction Cache Prefetching Using Multilevel Branch Prediction

    No full text
    This paper presents an instruction cache prefetching mechanism capable of prefetching past branches in multiple-issue processors. Such processors at high clock rates often use small instruction caches which have significant miss rates. Prefetching from secondary cache can hide the instruction cache miss penalties but only if initiated sufficiently far ahead of the current program counter. Existing instruction cache prefetching methods are strictly sequential and cannot do that due to their inability to prefetch past branches. By keeping branch history and branch target addresses we predict a future PC several branches past the current branch. We describe a possible prefetching architecture and evaluate its accuracy, the impact of the instruction prefetching on performance, and its interaction with sequential prefetching. For a 4issue processor and a cache architecture patterned after the DEC Alpha-21164 we show that our prefetching unit can be more effective than sequential prefetching..

    Decoupled Access DRAM Architecture

    No full text
    This paper discusses an approach to reducing memory latency in future systems. It focuses on systems where a single chip DRAM/processor will not be feasible even in 10 years, e.g. systems requiring a large memory and/or many CPU's. In such systems a solution needs to be found to DRAM latency and bandwidth as well as to inter-chip communication. Utilizing the projected advances in chip I/O bandwidth we propose to implement a decoupled access-execute processor where the access processor is placed in memory. Aprogram is compiledtorunasacomputational process and several access processes with the latter executing in the DRAM processors. Instruction set extensions are discussedto support this paradigm. Using multi-level branch prediction the access processor stays ahead of the execute processor and keeps the latter supplied with data. The system reduces latency by moving address computation to memory and thus avoiding sending address to memory by the computational processor. This and the fetchahead capabilities of the access processor arecombined with multiple DRAM "streaming" to improve performance. DRAM caching is assumedtobeused to assist in this as well

    Stride-directed Prefetching for Secondary Caches

    No full text
    skim @ aus tin.ibm. co
    corecore